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SMP {Single Instruction stream/Multiple instruction Pipelining): a novel high-speed 
singie-prac^^ 

K. Murakami, N. Irie, S. Tomita 

April 1989 ACM SIGARCH Computer Architecture News , Proceedings of the 16th 

annual international symposium on Computer architecture ISCA '89, volume 

17 Issue 3 

Publisher: ACM Press 

Additional Information: full citation, abstract, references, citings, index 
terms 



Full text available: f§ .pdfil .23. MB) 



SIMP is a novel multiple instruction-pipeline parallel architecture. It is targeted for 
enhancing the performance of SISD processors drastically by exploiting both temporal and 
spatial parallelisms, and for keeping program compatibility as well. Degree of 
performance enhancement achieved by SIMP depends on; i) how to supply multiple 
instructions continuously, and ii) how to resolve data and control dependencies 
effectively. We have devised the outstanding techniques for instruction fetch an ... 

2 Faciiitating superscalar processing via a combined static/dynamic register renaming Q 
^ scheme 

^ Eric Sprangle, Yale Patt 

November 1994 Proceedings of the 27th annual international symposium on 

Microarchitecture 
Publisher: ACM Press 

Full text available: ^pdf{M4,A5.KBj Additional Information: MLcitatjon, references, citings, index .terms 



Keywords: out-of-order execution, predicated execution, register renaming, superscalar 
processors 



3 instruction-level parallelism from execution interlock coliapsing 
m£ Nadeem Malik, Richard J. Eickemeyer, Stamatis Vassiliadis 
^ September 1992 ACM SIGARCH Computer Architecture News, volume 20 issue 4 
Publisher: ACM Press 

Full text available: B | | pdff579.86 KB) Additional Information: full citation, abstract, citings, index terms 

An innovative technique has been developed that permits the collapsing of execution 
interlocks between integer ALU operations as well as between address generation 
operations, allowing parallel execution of two instructions, having true dependencies, in a 
single cycle. Given that the proposed scheme has been shown not to increase the 
machine cycle time, it potentially provides an attractive means for increasing the 
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Nadeem Malik, Richard J. Eickemeyer, Stamatis Vassiiiadis 

December 1992 ACM SIG MICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available: ^pdff1.12 MB) Additional Information: full citation, references, citings, index terms 



Limitation of superscalar microprocessor performance | 
Thang Tran, Chuan-lin Wu 

December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1*2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available: ^jidM9ILS5.J4Bjl Additional Information: MxitaliSfl, XgfefillG.es, citings, jndexjerais 



Tuning c^ 

Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, Dean M. Tullsen 
December 1997 Proceedings of the 30th annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: ^^^ 45 mb) j S3 Additional Information: full citation , abstract, references , citings. Index 
Publisher. Site 

Compiler optimizations are often driven by specific assumptions about the underlying 
architecture and implementation of the target machine. For example, when targeting 
shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in 
order to decrease high-cost, inter-processor communication. This paper reexamines 
several compiler optimizations in the context of simultaneous multithreading (SMT), a 
processor architecture that issues instructions from multiple threads to the f ... 

Keywords: cache size, compiler optimizations, cyclic algorithm, fine-grained sharing, 
instructions, inter-processor communication, inter-thread instruction-level parallelism, 
latency hiding, loop tiling, loop-iteration scheduling, memory system resources, 
optimising compilers, parallel architecture, parallel programs, performance, processor 
architecture, shared-memory multiprocessors, simultaneous multithreading, software 
speculative execution 



7 Converting thread-ievel parallelism- to instruction-level parallelism via simultaneous Q 
& multithreading 

^ Jack L. Lo, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, S. J. Eggers 
August 1997 ACM Transactions on Computer Systems (TOCS), volume 15 issue 3 

Publisher: ACM Press 

Full text available: f Bpdff 526.39 KB) Additional lnformation: Miration, abstract, referees, citings, index 
^ * terms, review 

To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide- 
issue super-scalar processors exploit ILP by executing multiple instructions from a single 
program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads 
in parallel on different processors. Unfortunately, both parallel processing styles statically 
partition processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
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8 EnMnced„su^ 

J. K. Pickett, D. G. Meyer 
^ December 1993 Proceedings of the 1993 ACM/IEEE conference on Supercomputing 

Publisher: ACM Press 

Full text available: ^x>dfC^8,M.K.Bj Additional Information: fuJLcitatipn, references, index terms 



9 TM.case.for.a.sjns 

^ Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, Kunyung Chang 

^ September 1996 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the seventh international conference on Architectural 
support for programming languages and operating systems ASPLOS- 

VII, Volume 31 , 30 Issue 9 , 5 

Publisher: ACM Press 

Full text available- f% S pdf(1.10 MB) Additional Information: full citation, abstract, references, citings, index 
' " terms 

Advances in IC processing allow for more microprocessor design options. The increasing 
gate density and cost of wires in advanced integrated circuit technologies require that we 
look for new ways to use their capabilities effectively. This paper shows that in advanced 
technologies it is possible to implement a single-chip multiprocessor in the same area as a 
wide issue superscalar processor. We find that for applications with little parallelism the 
performance of the two microarchitectures is co ... 

10 OHMEGA: a VLS! superscalar processor architecture for numerical applications 
Masaitsu Nakajima, Hiraku Nakano, Yasuhiro Nakakura, Tadahiro Yoshida, Yoshiyuki Goi, Yuji 
Nakai, Reiji Segawa, Takeshi Kishida, Hiroshi Kadota 

April 1991 ACM SIGARCH Computer Architecture News , Proceedings of the 18th 

annual international symposium on Computer architecture ISCA '91, volume 

19 Issue 3 

Publisher: ACM Press 

Full text available: f fj pdf(941.13 KB) Additional Information: full citation , references , citings , index terms 



11 initial results on the performance and cost of vector microprocessors 
Corinna G. Lee, Derek J. DeVries 

December 1997 Proceedings of the 30th annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: ^ ^ Additional Information: full citation , abstract, references, citings, index 

Publisher Site teima 

Increasingly wider superscalar processors are experiencing diminishing performance 
returns while requiring larger portions of die area dedicated to control rather than 
datapath. As an alternative to using these processors to exploit parallelism effectively, we 
are investigating the viability of using single-chip vector microprocessors. This paper 
presents some initial results of our investigation where we compare the performance and 
cost of vector microprocessors to that of aggressive, out-of-or ... 

12 ExpJojtLnjgj.nstructlgn 

^ Ravi Nair, Martin E. Hopkins 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 
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Full text available: pdf(2.01 MB? Additional Information: full citation, abstract, references, citings, index 

terms 

Modern processors employ a large amount of hardware to dynamically detect parallelism 
in single-threaded programs and maintain the sequential semantics implied by these 
programs. The complexity of some of this hardware diminishes the gains due to 
parallelism because of longer clock period or increased pipeline latency of the machine. In 
this paper we propose a processor implementation which dynamically schedules groups of 
instructions while executing them on a fast simple engine and caches them f ... 

13 A comparison of three current superscalar designs 
Michael Laird 

^ June 1992 ACM SIGARCH Computer Architecture News, Volume 20 issue 3 
Publisher: ACM Press 

Full text available: ^ odg 824.41. KB) Additional Information: MLcitatjon, abstract, jndj&terms 

A standardized view of superscalar architectures is presented, and three current 
superscalar designs are comparedusing this framework. The designs studied are the 
Metaflow Light-ning SPARC, the IBM RS/6000, and the Intel i960MM. 

14 Performance eya!uatjon..of the 

Trung A. Diep, Christopher Nelson, John Paul Shen 
^ May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture ISCA '95, volume 

23 Issue 2 

Publisher: ACM Press 

Full text available: « od«135 MB) Additional lnformation: MsMML absJrM, references , edicts, index 
' " te rms 

The PowerPC 620™ microprocessor is the most recent and performance leading 
member of the PowerPC™ family. The 64-bit PowerPC 620 microprocessor employs 
a two-phase branch prediction scheme, dynamic renaming for all the register files, 
distributed multi-entry reservation stations, true out-of-order execution by six execution 
units, and a completion buffer for ensuring precise exceptions. This paper presents an 
instruction-level performance evaluation of the 620 microarchitectu ... 

15 Improving superscaiar instruction dispatch and issue by exploiting dynamic code 
A sequences 

■™ Sriram Vajapeyam, Tulika Mitra 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 

Full text available: f B P dff1.76MB) Additional Information: full citation , abstract, references , cite, Mte 

terms 

Superscalar processors currently have the potential to fetch multiple basic blocks per 
cycle by employing one of several recently proposed instruction fetch mechanisms. 
However, this increased fetch bandwidth cannot be exploited unless pipeline stages 
further downstream correspondingly improve. In particular, register renaming a large 
number of instructions per cycle is difficult. A large instruction window, needed to receive 
multiple basic blocks per cycle, will slow down dependence resolution ... 

16 Exploiting choice: instruction fetch and issue on an implementable simultaneous 
^ multithreading processor 

^ Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. 
Stamm 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 

24 Issue 2 

Publisher: ACM Press 
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terms 

Simultaneous multithreading is a technique that permits multiple independent threads to 
issue multiple instructions each cycle. In previous work we demonstrated the performance 
potential of simultaneous multithreading, based on a somewhat idealized model. In this 
paper we show that the throughput gains from simultaneous multithreading can be 
achieved without extensive changes to a conventional wide-issue superscalar, either in 
hardware structures or sizes. We present an architecture for s ... 

17 implementation trade-offs in using a restricted data flow architecture in a high 
^ performance RISC microprocessor 

^ M. Simone, A. Essen, A. Ike, A. Krishnamoorthy, T. Maruyama, N. Patkar, M. Ramaswami, M. 
Shebanow, V. Thirumalaiswamy, D. Tovey 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture ISCA '95, volume 

23 Issue 2 

Publisher: ACM Press 

[— hi i i ui m rA Additional Information: full citation abstract, references, citings, index 

Full text available: 113 pdft 1 ,G4 MB) : ' ' ' 

terms 

The implementation of a superscalar, speculative execution SPARC-V9 microprocessor 
incorporating Restricted Data Flow principles required many design trade-offs. 
Consideration was given to both performance and cost. Performance is largely a function 
of cycle time and instructions executed per cycle while cost is primarily a function of die 
area. Here we describe our Restricted Data Flow implementation and the means with 
which we arrived at its configuration. Future semiconductor technology advan ... 

18 Complexity-effective superscalar processors 
Subbarao Palacharla, Norman P. Jouppi, J. E. Smith 

^ May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 

i- .1* a ( li 0i *j*<o n-i Additional Information: full citation , abstract, references , citings. Index 

Full text available: pdff221 MB) ' ' ' 

" terms 

The performance tradeoff between hardware complexity and clock speed is studied. First, 
a generic superscalar pipeline is defined. Then the specific areas of register renaming, 
instruction window wakeup and selection logic, and operand bypassing are analyzed. Each 
is modeled and Spice simulated for feature sizes of 0.8µm, 0.35µm, and 
0.18µm. Performance results and trends are expressed in terms of issue width and 
window size. Our analysis indicates that window wakeu ... 

19 The expandable split window paradigm for exploiting fine-grain paraileisim 
Manoj Franklin, Gurindar S. Sohi 

^ April 1992 ACM SIGARCH Computer Architecture News , Proceedings of the 19th 

annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 

Publisher: ACM Press 

Full text available: f Bpdff1.25 MB) Additional Information: full citation , abstract references , citings, Index 
k^ 3 " ^ " terms 

We propose a new processing paradigm, called the Expandable Split Window (ESW) 
paradigm, for exploiting fine-grain parallelism. This paradigm considers a window of 
instructions (possibly having dependencies) as a single unit, and exploits fine-grain 
parallelism by overlapping the execution of multiple windows. The basic idea is to connect 
multiple sequential processors, in a decoupled and decentralized manner, to achieve 
overall multiple issue. This processing paradigm shares a number of pr ... 

20 RISC versus CISC: a tale of two chips 
Dileep Bhandarkar 
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Full text available: ^pdfi'771.63 KB) Additional Information: full citation , abstract , index terms 

This paper compares an aggressive RISC and CISC implementation built with comparable 
technology. The two chips are the Alpha* 21164 and the Intel Pentium® Pro 
processor. The paper presents performance comparisons for industry standard 
benchmarks and uses performance counter statistics to compare various aspects of both 
designs. 
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1 Converting thread-level parallelism to instruction-level parallelism via simultaneous Q 



multithreading 

Jack L. Lo, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, S. J. Eggers 
August 1997 ACM Transactions on Computer Systems (TOCS), volume is issue 3 

Publisher: ACM Press 

Full text available: -R DdfT526.39 KB) Additional lnformation: McMion, mmM, references , dtings, index 
^ " terms, review 

To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide- 
issue super-scalar processors exploit ILP by executing multiple instructions from a single 
program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads 
in parallel on different processors. Unfortunately, both parallel processing styles statically 
partition processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
multithreading, simultaneous multithreading, thread-level parallelism 



2 The case for a singie-chip multiprocessor 

^ Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, Kunyung Chang 

^ September 1996 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the seventh international conference on Architectural 
support for programming languages and operating systems ASPLOS- 

VII, Volume 31 , 30 Issue 9 , 5 

Publisher: ACM Press 

Full text available- fS?l D^fP 10 MB* Additional Information: felicitation, abstract, references, citings, index 
• -i|p-.u._..* 4 terms 

Advances in IC processing allow for more microprocessor design options. The increasing 
gate density and cost of wires in advanced integrated circuit technologies require that we 
look for new ways to use their capabilities effectively. This paper shows that in advanced 
technologies it is possible to implement a single-chip multiprocessor in the same area as a 
wide issue superscalar processor. We find that for applications with little parallelism the 
performance of the two microarchitectures is co ... 

3 Exploiting choice: instruction fetch and issue on an impiementabie simultaneous 

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. 
Stamm 
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May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 

24 Issue 2 

Publisher: ACM Press 

Full text available- f» pdftl 43 MB* Additional Information: felicitation, abstract Merences, citings, index 
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Simultaneous multithreading is a technique that permits multiple independent threads to 
issue multiple instructions each cycle. In previous work we demonstrated the performance 
potential of simultaneous multithreading, based on a somewhat idealized model. In this 
paper we show that the throughput gains from simultaneous multithreading can be 
achieved without extensive changes to a conventional wide-issue superscalar, either in 
hardware structures or sizes. We present an architecture for s ... 

lunjngcompM 

Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, Dean M. Tullsen 

December 1 997 Proceedings of the 30th annual ACM/IEEE international symposium 

on Microarchitecture 
Publisher: IEEE Computer Society 

Full text available: ^j p^^i 45 j fj Additional Information: full citation, abstract , references , citings , index 
Pubiishe„r.Site terms 

Compiler optimizations are often driven by specific assumptions about the underlying 
architecture and implementation of the target machine. For example, when targeting 
shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in 
order to decrease high-cost, inter-processor communication. This paper reexamines 
several compiler optimizations in the context of simultaneous multithreading (SMT), a 
processor architecture that issues instructions from multiple threads to the f ... 

Keywords: cache size, compiler optimizations, cyclic algorithm, fine-grained sharing, 
instructions, inter-processor communication, inter-thread instruction-level parallelism, 
latency hiding, loop tiling, loop-iteration scheduling, memory system resources, 
optimising compilers, parallel architecture, parallel programs, performance, processor 
architecture, shared-memory multiprocessors, simultaneous multithreading, software 
speculative execution 
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^ Sriram Vajapeyam, Tulika Mitra 
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annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 
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^ "' %t terms 

Superscalar processors currently have the potential to fetch multiple basic blocks per 
cycle by employing one of several recently proposed instruction fetch mechanisms. 
However, this increased fetch bandwidth cannot be exploited unless pipeline stages 
further downstream correspondingly improve. In particular, register renaming a large 
number of instructions per cycle is difficult. A large instruction window, needed to receive 
multiple basic blocks per cycle, will slow down dependence resolution ... 

6 Limitation of superscalar microprocessor performance 

Thang Tran, Chuan-lin Wu 
^ December 1992 ACM SIG MICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 
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